-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54452] Fix empty response from SparkConnect server for spark.sql(...) inside FlowFunction
#53156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54452] Fix empty response from SparkConnect server for spark.sql(...) inside FlowFunction
#53156
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @SCHJonathan . FYI, Apache Spark community does not use this kind of tag, '[bugfix]' in the PR title.
spark.sql(...) inside FlowFunctionspark.sql(...) inside FlowFunction
spark.sql(...) inside FlowFunctionspark.sql(...) inside FlowFunction
|
fyi @vicennial |
|
cc @sryza |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't ind, please file a new JIRA issue as a bug fix instead of reusing SPARK-54020, @SCHJonathan .
...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala
Outdated
Show resolved
Hide resolved
spark.sql(...) inside FlowFunctionspark.sql(...) inside FlowFunction
|
|
||
| test( | ||
| "SPARK-54452: spark.sql() outside a pipeline flow function should return a " + | ||
| "sql_command_result") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we expect the same behavior? According to the log, it is guarded by insidePipelineFlowFunction condition, isn't it?
Lines 2993 to 2996 in da7389b
| if (insidePipelineFlowFunction) { | |
| result.setRelation(relation) | |
| return | |
| } |
spark.sql(...) inside FlowFunctionspark.sql(...) inside FlowFunction
| proto.Command | ||
| .newBuilder() | ||
| .setSqlCommand( | ||
| proto.SqlCommand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why we use different setSqlCommand in two test cases? Can we use the following like the other test case in the same way?
.setSqlCommand(
proto.SqlCommand
.newBuilder()
.setInput(proto.Relation
.newBuilder()
.setSql(proto.SQL.newBuilder().setQuery("SELECT * FROM RANGE(5)"))
.build())
.build())
.build())|
Gentle ping, @SCHJonathan . |
…sql(...)` inside FlowFunction ### What changes were proposed in this pull request? In PR #53024, we added SDP support for `spark.sql(...)` inside a FlowFunction. For these calls, instead of eagerly executing the SQL, the Spark Connect server should return the raw logical plan to the client and defer execution to the flow function. However, in that PR we constructed the response object but forgot to actually return it to the Spark Connect client, so the client received an empty response. This went unnoticed in tests because, when the client sees an empty `spark.sql(...)` response, [it falls back to creating an empty DataFrame holding the raw logical plan](https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/session.py#L829-L835), which happens to match the desired behavior. This PR fixes the bug by returning the proper response instead of relying on that implicit fallback. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? This PR fixes a bug introduced in #53024 where the server did not return the constructed spark.sql(...) response to the client. ### How was this patch tested? New tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #53156 from SCHJonathan/jonathan-chang_data/fix-spark-sql-bug. Authored-by: Yuheng Chang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 997525c) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Merged to master/4.1. |
…sql(...)` inside FlowFunction ### What changes were proposed in this pull request? In PR apache#53024, we added SDP support for `spark.sql(...)` inside a FlowFunction. For these calls, instead of eagerly executing the SQL, the Spark Connect server should return the raw logical plan to the client and defer execution to the flow function. However, in that PR we constructed the response object but forgot to actually return it to the Spark Connect client, so the client received an empty response. This went unnoticed in tests because, when the client sees an empty `spark.sql(...)` response, [it falls back to creating an empty DataFrame holding the raw logical plan](https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/session.py#L829-L835), which happens to match the desired behavior. This PR fixes the bug by returning the proper response instead of relying on that implicit fallback. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? This PR fixes a bug introduced in apache#53024 where the server did not return the constructed spark.sql(...) response to the client. ### How was this patch tested? New tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53156 from SCHJonathan/jonathan-chang_data/fix-spark-sql-bug. Authored-by: Yuheng Chang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
In PR #53024, we added SDP support for
spark.sql(...)inside a FlowFunction. For these calls, instead of eagerly executing the SQL, the Spark Connect server should return the raw logical plan to the client and defer execution to the flow function.However, in that PR we constructed the response object but forgot to actually return it to the Spark Connect client, so the client received an empty response.
This went unnoticed in tests because, when the client sees an empty
spark.sql(...)response, it falls back to creating an empty DataFrame holding the raw logical plan, which happens to match the desired behavior. This PR fixes the bug by returning the proper response instead of relying on that implicit fallback.Why are the changes needed?
Does this PR introduce any user-facing change?
This PR fixes a bug introduced in #53024 where the server did not return the constructed spark.sql(...) response to the client.
How was this patch tested?
New tests
Was this patch authored or co-authored using generative AI tooling?
No